## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
I am going to examine the data and what all variables and attributes it contains.
This report explores a dataset containing attributes for 4,898 white wines with 13 which includes 11 variables on quantifying the chemical properties of each wine.
## [1] 4898
## [1] 13
There are 4898 rows and 13 columns. I have also reomved serial number giver by ‘X’ which is not very meaningful in our analysis.
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
There are only numerical values and integer values in the data set. We might have to change some of the variable types in our analysis(specificaly ‘quality’).
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
This gives us the central tendencies of all variables.
I am going to evaluate each variables in the following section to examine the distributions.
This is a noraml curve and gives a fair understanding of the distribution. This distribution is unimodal with the fixed acidity peaking around 6.8. There were some outliers before fixed acidity value of 4 and beyond 10 which has been removed. According to waterhouse most wines have tartaric acid value between 1 g/dm^3 and 4 g/dm^3. Is there a strong correlation between fixed acidity and pH value? Now let’s explore what the plots look like for other variables.
This is also a unimodal, peaking around volatile acidity value of 0.28. Waterhouse claims that average acetic acid value is less than 400 mg/L. This is in sync with our dataset. The legal limit of acetic acid in US for white wine is 1.1 g/dm^3. Too much acetic acid can result in unpleasant aromas. In addition to undesirable aromas, both acetic acid and acetaldehyde are toxic to Saccharomyces cerevisiae and may lead to stuck fermentations.
This distribution is also normal with citric acid value peaking around 0.3. Why is there a sudden peak at arounf 0.49?
According to waterhouse one would expect to see 0 to 500mg/L citric acid. This might be why the value peaks at around 0.49-0.5.
I observe a long tail distribution there are some extreme outliers around 30s and 70s which has been removed in the graph. According to winefolly.com: < 1 g/L(d/dm^3) - Bone Dry 1 to 10 g/L - Dry 10 to 35 g/L - Off-Dry 35 to 120 g/L - Sweet Wine 120 to 220 g/L - Very Sweet Wine
We can conclude that most of the wines in the data set are Dry wines.
A dry wine is when the yeast eats up all the sugar that is available and makes ethanol as a by product. This is why some sweet wines have less alcohol than its dry counterpart. We can look at the correlation between residual sugar content and alcohol. Is this an inverse relationship?
The transformed distribution is bimodal and peaks at two places. First around 4 and then around 9. What do these peaks represent?
Majority of the values lies between 0 and 1. This is also a normal distribution with peak at around 0.4. Most wines have a salt content of less than 0.1.
Free Sulfur Dioxide seems like a normal distribution with its peak at approximately 30. Most wines have a Sulphur Dioxide content of less than 100.
Total Sulfur Dioxide Value is a normal distribution with a peak around 120s. Sulfites is used to preserve wines. Most people can easily digest sulfites but some people have extremem allergic reactions to sulfites. According to waterhouse the average sulfite content in wine is around 80 mg/L which is almost in sync with the dataset. S02 content above 50 is detectable in the nose and taste of wine. Given this, there are lots of wine in the dataset where SO2 content might become evident in the nose and taste of wine.
Density seems to follow a normal distribution with peak at nearly 0.992. There are a few outliers as well.
pH seems to follow a normal distribution with peak at nearly 3.15. According to Dr.Vinny’s post in winespectartor.com, the ideal pH value for white wines is around 3.0-3.4.
Normal distribution with a peak at .5. Potassium sulphate is the additive which will contribute to sulfur dioxide gas, which acts as an antimicrobial and antioxident.
White wines have a distribution between 8.5% and 14%, with concentration between 9% and 10.5%.
Most of the wines are given a quality score of 6. These values might be biased in many ways as it is a sensory data and completely subjective. The data might vary if a different set of experts is used for this.
Let’s look at all variable valus by quality:
The fixed acidity (tartarc acid) for wines of different quality peaks between 6 and 8 g/L
This does not give any particular insight as such. Volatile acidity value of all quality types peaks around 0.2.
Citric acid graph also does not provide us any particular insights. There’s a peak around 0.5 which was examined earlier.
This plot does not give any particular insight.
There’s a general peak between 20 and 40. This does not give us any key takeaways.
This gives us no particular insights.
This gives us no particular insights.
This gives us no particular insights.
This is the only plot which offers us some insights in this section. As realised throughout our analysis, alcohol has a meaningful correlation with quality.
I am going to create a new variable called Dryness based on the literature available online.
Most of the wines in out datasets belong to the dry category.
The data set consists of 4,898 variants of the Portuguese White Wine “Vinho Verde”, with measurements of eleven chemical properties:
Fixed Acidity: acid that contributes to the conservation of wine. Volatile Acidity: Amount of acetic acid in wine at high levels can lead to an unpleasant taste of vinegar. Citric Acid: found in small amounts, can add “freshness” and flavor to wines. Residual sugar: amount of sugar remaining after the end of the fermentation. Chlorides: amount of salt in wine. Free Sulfur Dioxide: it prevents the increase of microbes and the oxidation of the wine. Total Sulfur Dioxide: it shows the aroma and taste of the wine. Density: density of water, depends on the percentage of alcohol and amount of sugar. pH: describes how acid or basic a wine is on a scale of 0 to 14. Sulfates: additive that acts as antimocrobian and antioxidant. Alcohol: percentage of alcohol present in the wine.
And a sensorial property: - Quality: grade between 0 and 10 given by specialists.
Observations: - Most wines have medium quality (5 and 6) - There’s no evident predictor of quality from the univariate analysis
The main features in the data set is quality which is also our dependent variable. I’d like to determine which features are best for predicting the quality of wine. I suspect some combination of the chemical properties variables can be used to build a predictive model to determine the quality of White wines.
It is very difficult to predict quality from the given variable at first glance. I did not notice any significant relationship even after facet wrapping various variables according to quality. Perhaps I could investigate further by taking residual sugar relations with other properties as a starting point to further my investigation.
I created a new variable called dryness which is based on the residual sugar content as mentioned below: < 1 g/L(d/dm^3) - Bone Dry 1 to 10 g/L - Dry 10 to 35 g/L - Off-Dry 35 to 120 g/L - Sweet Wine 120 to 220 g/L - Very Sweet Wine
Most of the wines are Dry in nature.
It was necessary to remove anomalies and extreme vales in some cases for better visualisations. Some properties like residual sugar and density had extreme values. In addition, the residual sugar of the white wine presented a long tail distribution. I used log10 transformation and got a bimodal distribution.
I am going to start with a clean pair.panels plot to examine some key relationships, plots and correlation between variables.
Now I am going to plot scatter plots to analyse relationship between our feature of focus(quality) and chemical properties. I will also be jittering these plots to give a better perspective.
An initial decreasing trend followed by an increasing trend was observed. For alcohol of quality below 6 it is a negative relationship.
After removing outliers, this looks like a negative relationship. According to the literature available online, I was able to confirm this relationship.
After removing outliers, this looks like a negative relationship. According to the literature available online, I was able to confirm this relationship.
No linear relationship was observed. There are no takeaways from this plot.
No linear relationship was observed. There are no takeaways from this plot.
A decreasing trend was observed. This is also logically sound sinc fermentation of sugar results in more alcohol. From our analysis, it is fair to state that the higher alcohol content the better the quality of wine.
No linear relationship was observed. There are no takeaways from this plot.
No linear relationship was observed. There are no takeaways from this plot.
A decreasing trend was observed.
No linear relationship was observed. There are no takeaways from this plot.
No linear relationship was observed. There are no takeaways from this plot.
A positive trend was observed.
According to Waterhouse the total acidity is the sum of fixed and volatile acidity
It is clear from above that alcohol has the strongest correlation with quality. Here are the noteworthy correlations involving quality. I had to utilize the integer version of the quality variable in order to calculate the correlations.
Quality and alcohol: 0.436 Quality and density: -0.307
However, both these correlations can’t be considered strong.
Let’s take a look at boxplots involving quality.
Only alcohol and density have a meaningful relationship with quality score. I have arranged both these plots below.
I am going to analyse quality by taking into consideration central tendencies of density to see this relationship better.
## # A tibble: 7 x 6
## quality mean_density median_density min_density max_density n
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 3 0.9948840 0.994425 0.99110 1.00010 20
## 2 4 0.9942767 0.994100 0.98920 1.00040 163
## 3 5 0.9952626 0.995300 0.98722 1.00241 1457
## 4 6 0.9939613 0.993660 0.98758 1.03898 2198
## 5 7 0.9924524 0.991760 0.98711 1.00040 880
## 6 8 0.9922359 0.991640 0.98713 1.00060 175
## 7 9 0.9914600 0.990300 0.98965 0.99700 5
The median data again show that as quality increases, density values decrease.
In addition to evaluating the correlations related to quality, I also want to probe how other variables work with each other. Here are the correlations of note that do not involve quality:
Total sulfur dioxide and residual sugar: 0.401 Total sulfur dioxide and free sulfur dioxide: 0.616 Total sulfur dioxide and alcohol: -0.449 Density and residual sugar: 0.839 Alcohol and density: -0.780 Residual sugar and alcohol: -0.451 Fixed acidity and pH: -0.426
Density, alcohol, and residual sugar all appear to be strongly correlated to each other, so I am going to take a closer look at those plots.
The correlations are very evident in the charts shown above. Sugar must be more dense than other ingredients in the wine, because higher density levels imply higher sugar quanity. Similarly, alcohol seems to imply lesser density. Lastly, alcohol and sugar may offset each other during the wine-making process, because lower levels of alcohol tend to have higher levels of sugar (and vice versa)
I also wants to make a special note about pH levels and acidity. All Three acidity values have strong correlation with pH. This is logical as higher pH value corresponds to lower acidity.
I evaluated all the variables with out main feature variable quality and observed that alcohol content has a strong impact on quality. However, it is still loosely correlated. Another variable that slightly influence quality may be the density.
Initially, as alcohol content increases, quality decreases. Subsequently when alcohol content increases, quality increases. This is not a linear model as represented by the smoothing line.
I discovered strong correlations between alcohol, residual sugar and density. As alcohol content increases, density tends to decrease rather linearly. Furthermore, as residual sugar increases density also increases. A linear model fits this well. Finally, as residual sugar level rises alcohol level decreases. This was clarified by the literatue available online. I mainly referred to literature provided by waterhouse.
The strongest correlation was seen between Density and Residual Sugar.
You can see that the graph generally gets darker to the right. And the corellation between alcohol and quality and density and quality is evident.
Given the sae quality, win without sulfur aroma is more likely to have higher alcohol level. For instance, wines that have a quality score of 6 and don’t have sulfur smell, the median alcohol by volume is 10.6% as compared to 9.6 % among wines with same quality score with evident sulfur smell represented by blue boxplots. Therefore, you are more likely to get better quality wine if sulfur level is unnoticeable.
I am going to try to construct a linear model to predict th quality score based on the chemical properties.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = ww)
## m2: lm(formula = quality ~ alcohol + density, data = ww)
## m3: lm(formula = quality ~ alcohol + density + chlorides, data = ww)
## m4: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity,
## data = ww)
## m5: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity +
## volatile.acidity, data = ww)
## m6: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity +
## volatile.acidity + pH, data = ww)
## m7: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity +
## volatile.acidity + pH + total.sulfur.dioxide, data = ww)
## m8: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity +
## volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar),
## data = ww)
## m9: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity +
## volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar) +
## citric.acid, data = ww)
## m10: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity +
## volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar) +
## citric.acid + free.sulfur.dioxide, data = ww)
## m11: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity +
## volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar) +
## citric.acid + free.sulfur.dioxide + sulphates, data = ww)
##
## ==================================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.492*** -21.150*** -31.387*** -47.652*** -47.870*** -43.543*** 41.731*** 42.639*** 37.700*** 47.757***
## (0.098) (6.165) (6.162) (6.355) (6.195) (6.222) (6.510) (11.223) (11.284) (11.294) (11.437)
## alcohol 0.313*** 0.360*** 0.343*** 0.356*** 0.405*** 0.406*** 0.408*** 0.310*** 0.308*** 0.310*** 0.296***
## (0.009) (0.015) (0.015) (0.015) (0.015) (0.015) (0.015) (0.018) (0.019) (0.019) (0.019)
## density 24.728*** 23.671*** 34.437*** 50.909*** 51.237*** 46.805*** -39.975*** -40.902*** -36.049** -46.226***
## (6.079) (6.074) (6.293) (6.137) (6.199) (6.501) (11.351) (11.414) (11.422) (11.567)
## chlorides -2.382*** -2.421*** -1.323* -1.334* -1.399** -0.762 -0.808 -0.831 -0.818
## (0.558) (0.555) (0.539) (0.540) (0.541) (0.541) (0.544) (0.542) (0.541)
## fixed.acidity -0.087*** -0.101*** -0.103*** -0.103*** -0.027 -0.029 -0.020 -0.014
## (0.014) (0.014) (0.015) (0.015) (0.017) (0.017) (0.017) (0.017)
## volatile.acidity -2.085*** -2.088*** -2.112*** -2.117*** -2.101*** -1.981*** -1.953***
## (0.110) (0.111) (0.111) (0.110) (0.112) (0.114) (0.114)
## pH -0.031 -0.042 0.326*** 0.332*** 0.343*** 0.317***
## (0.081) (0.081) (0.090) (0.090) (0.090) (0.090)
## total.sulfur.dioxide 0.001* 0.000 0.000 -0.001 -0.001*
## (0.000) (0.000) (0.000) (0.000) (0.000)
## log(residual.sugar) 0.225*** 0.226*** 0.210*** 0.232***
## (0.024) (0.024) (0.024) (0.025)
## citric.acid 0.075 0.057 0.037
## (0.097) (0.096) (0.096)
## free.sulfur.dioxide 0.004*** 0.004***
## (0.001) (0.001)
## sulphates 0.502***
## (0.099)
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.190 0.192 0.195 0.202 0.256 0.256 0.257 0.270 0.270 0.274 0.278
## adj. R-squared 0.190 0.192 0.195 0.201 0.255 0.255 0.256 0.269 0.269 0.272 0.276
## sigma 0.797 0.796 0.795 0.792 0.764 0.764 0.764 0.757 0.757 0.755 0.754
## F 1146.395 583.290 396.315 309.222 336.912 280.734 241.554 225.827 200.787 184.336 170.797
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5831.127 -5822.011 -5802.684 -5629.932 -5629.861 -5627.322 -5584.491 -5584.187 -5570.814 -5557.825
## Deviance 3112.257 3101.773 3090.247 3065.956 2857.136 2857.053 2854.093 2804.611 2804.262 2788.992 2774.238
## AIC 11684.782 11670.255 11654.021 11617.368 11273.865 11275.722 11272.645 11188.982 11190.373 11165.629 11141.649
## BIC 11704.272 11696.241 11686.504 11656.348 11319.341 11327.694 11331.114 11253.948 11261.836 11243.588 11226.105
## N 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898
## ==================================================================================================================================================================================
No combinations of variables coulg give a good model to predict quality score. The R2 value is very low evn after including all variables. This is not a strong correlation.
In this section, I tried to visualise some of the variables more concisely and precisely. Some of the insights into relationships between alcohol, density and residual sugars were strengthened.
It is interesting to note that the chemical properties trends of wines og 5 and below quality is almost the inverse of chemical property trends of wines of quality 6 and above. This might be due to the influence of an unknown variable which is not given in the dataset. Alternatively, there might be something that I have missed. The use of artificial flavouring and other chemical agents might give the same chemical properties for the low quality wines but different tastes.
I tried to fit a linear model into the dataset to predict the quality of white wine based on the features provided in the data set.
The model grew stronger as I added more features into the model. However, the linear model may not be the best way to represent this data. R2 values were too low and residuals were high. Using all the features provided is not very different from using only alcohol as a predictor, which was tried in the bivariate section. This might be because some of the features are correlated to each other.
To improve the model we might need to introduce new features into the model or new way to transform the data. Moreover, there might be a better method than linear to predict quality.
The strongest correlation observed between the feature of interest and any other feature was with alcohol at 0.436. This relationship can be visualised using the above chart. We can see that the concentration of points is increasing from left to right. That means as alcohol level increases quality also increases. Taking a closer look at th box plot we realise that the increasing trend is not steady. Between quality 3 and 5 it is a negative relationship. It is also safe to assume that after 12.5% alcohol content the quality of wine will decrease because the alcohol taste will overpower the native wine taste.
This is a good visualisation of the relationship between alcohol, density and quality. I have removed the outliers to make the visualisation better.
Alcohol and density is a negative relationship. That means as alchol content increases density decreases. Also, the better quality wines are concentrated at the left top of the graph. The graph disperses in the middle and converges at the right bottom. This also hints that as density increases, quality of wine tends to decrease.
The White Wines dataset contains information of 4898 samples of Portugese white wine (Vinho Verde) across 11 chemical properties and a special feature called quality score which was evaluated by wine experts. I started by exploring individual variables in the dataset and went on to investigate relationship between each chemical property with quality, which was chosen as the main feature in my analysis. Eventually, I tried to create a linear model to predict the quality of wine given other chemical properties.
There was a trend between quality and alcohol. But the other variables did not produce a strong correlation with quality. However, the variables were more or less strongly correlated with each other. Thgis might also be the reason why I was not able to come up with a linear model that predicts the quality score straight away. Transformations might be a technique that might have worked but I could not identify a direction to go forward with. Alternatively, absence of other features in the data set might also be a reason why I wasn’t able to produce a good linear model in my analysis.
Some limitations of this data includes missing features like Glycerol, Tannin, Amino acids, minerals, etc. Another limitation is that the quality score is a very subjective indicator. A more robust database could have produced a better model.
Having said that, this is the first project in R. I have so much to learn and I am sure that as the course progresses I will be able to deliver better.